Full Disclosure: The majority of this project came from a gun deaths visualization project I completed at Emory University. The goal of the original project was to run exploratory analyses on the dataset in R and then create a meaningful visualization in D3.js that would capture one of my interesting findings. This R Markdown file was modified to fit the requirements of Udacity’s Exploratory Data Analysis project, with a few new plots added in and some superfluous tables and graphs removed.
This dataset, which contains information on deaths by the hand of a gun from 2012 to 2014, came from FiveThirtyEight’s Gun Deaths in America project, which can be viewed here: (https://fivethirtyeight.com/features/gun-deaths/). The full dataset is available on (https://github.com/fivethirtyeight/guns-data).
For my project, I focused on suicides and accidental deaths from a gun. My reason for this focus was to show that while mass shootings are an important issue, there are other ways that guns can kill and other reasons they should be regulated.
The variables of the dataset are:
To make my analyses more interesting, I created three new variables:
year-month - factor; the year and month of a gun incident combined (“01-2012”, “02-2012”, …, “12-2014”). Used to look at gun deaths over time (rather than just by month or year).
The age divisions were created roughly based on the ones created in this chart (http://www.widener.edu/about/campus_resources/wolfgram_library/documents/life_span_chart_final.pdf). I created this variable in order to more clearly see differences among age groups in terms of gun deaths.
I created a few tables to get a sense of the full dataset before dividing it into dataframes for suicides and accidental deaths. The following are tables of some important variables in the dataset, showing the proportions of gun deaths for the different values of those variables. The tables are followed by bar charts to provide a visual aid for those proportions.
##
## F M
## 0.1433461 0.8566539
The victims were 86% Male and 14% Female.
##
## Asian/Pacific Islander Black
## 0.013155023 0.231115697
## Hispanic Native American/Native Alaskan
## 0.089505744 0.009097403
## White
## 0.657126133
The victims were 1% Asian/Pacific Islander, 23% Black, 9% Hispanic, 1% Native American/Native Alaskan, and 66% White.
##
## BA+ HS/GED Less than HS Some college
## 0.1302729 0.4319655 0.2196003 0.2181613
For education levels:
##
## Accidental Homicide Suicide Undetermined
## 0.016260405 0.348978640 0.626754765 0.008006191
The deaths were 2% accidental, 35% homicidal, 63% suicidal, and less than 1% undetermined.
##
## Spring Summer Fall Winter
## 0.2521181 0.2623961 0.2495784 0.2359075
When looking at deaths by season, 25% of deaths occurred in the spring, 26% occurred in the summer, 25% occurred in the fall, 24% occurred in the winter.
##
## 0 1
## 0.98609099 0.01390901
99% of gun death victims were not police officers. No histogram was needed.
##
## Farm Home Industrial/construction
## 0.004727704 0.608425373 0.002494618
## Other specified Other unspecified Residential institution
## 0.138320558 0.089192669 0.002041966
## School/instiution Sports Street
## 0.006749552 0.001287545 0.112167300
## Trade/service area
## 0.034592713
For location of death:
##
## 2012 2013 2014
## 0.3329729 0.3336971 0.3333300
The year variable is divided evenly, with 33% of deaths in 2012, 33% of deaths in 2013, and 33% of deaths in 2014.
Proportion tables were not as relevant for month and year_month, but a histogram was created for month and a bar chart was created for year-month to help visualize their distributions
##
## 2 and under 3 to 12 13 to 17 18 to 24 25 to 39
## 0.001200635 0.005616194 0.031782100 0.160369121 0.267116491
## 40 to 59 60 to 74 75 and older
## 0.309257789 0.142885493 0.081772177
The breakdown of victims by age group is as follows:
For the age variable, I looked at some summary statistics and created a histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 27.00 42.00 43.86 58.00 107.00 18
A histogram of death counts by age shows that for the full dataset, there is a rise and fall of death counts as age increases (it appears almost like a positively skewed normal distribution). The peak of this graph occurs in the 15-30 area. This makes sense, as the age groups of 18-24 and 25-39 take up about 43% of the total deaths when combined.
I then divided the dataset into dataframes for suicides and accidental deaths to see how they compared.
When death counts are plotted over time (using the year-month variable) for both suicides and accidental deaths, it can be seen that there are simply fluctuations of deaths from month to month. However, in both dataframes there is no real change that can be seen over the years.
Instead, let’s see what happens when we look at counts for each month (so all deaths from January are aggregated for 2012, 2013, and 2014, for example).
For suicides, we see a pretty uniform distribution. For accidental deaths, there isn’t really a clear trend either, except that maybe the summer months and the end of the year months have more deaths. Let’s look at accidental deaths by season to see if anything interesting shows up.
It looks like the winter months have the most accidental deaths, but not by that much.
Taking the focus away from time of year, I was interested in seeing whether police officer deaths played a big role in either category.
##
## 0
## 63175
##
## 0
## 1639
The first result is from suicides; the second is from accidental deaths. The police variable is an indicator variable, meaning a “0” denotes that a police officer was not a victim in that death while a “1” indicates they were. These tables seem to say that no police officers committed suicide or were killed accidentally. While these results are surprising, they actually makes sense if you read where the data was gathered from on FiveThirtyEight’s website: (https://fivethirtyeight.com/features/gun-deaths/). One source of the data was shootings committed by police officers while another was police officers being killed in the line of duty. Therefore, due to the nature of both of these sources, police officers dying by suicide or an accidental shooting would likely not be included in this dataset, even though these incidents probably occurred in those years.
To investigate the source of these police officer deaths further, I created a dataframe of police deaths and a proportion table for the “intent” variable in that dataframe. Here are the results:
##
## Homicide
## 1
We can see that 100% of the deaths of police officers in this dataset are from homicidal shootings.
Returning to age, let’s remind ourselves of the statistics we found for the full dataset:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 27.00 42.00 43.86 58.00 107.00 18
Now let’s compare them to the statistics for suicides and accidental deaths in that order:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 35.00 51.00 50.31 64.00 102.00 7
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 21.00 35.00 38.04 54.00 96.00 1
We can see that the statistics are significantly higher for the suicides dataframe than for the accidental deaths dataframe and the full dataset. This is likely because suicide rates are low among young children and high among middle aged men (which will be discussed later). In contrast, many more young people are killed accidentally by the hand of a gun, which is why the third quartile is at a low age of 54 (this will also be discussed later on). The summary statistics for accidental deaths are all lower than those of the full dataset.
While these statistics are helpful, it would also be beneficial to zoom in and see the death counts at individual ages for both dataframes and look for any trends over the ages. First, let’s look at suicides:
Interesting. The graph seems to steadily rise until age 22, and steadily fall at around age 70.
When looking at age groups instead of individual ages, this trend basically holds up, and the distribution looks fairly normal. The peak at ages 40 to 59 is consistent with the line graph, but I expected the 18 to 24 group to have a larger bar based on the graph above.
What about the accidental deaths?
This graph seems to peak more in certain areas, though there is still a general decline after age 22.
This distribution looks about normal too, albeit negatively skewed. What’s surprising about it is that it shows a peak at the same age group as the suicides dataframe, even though the line plot shows a peak much earlier (closer to ages 18 to 24). Still, the bars for the younger ages are higher than those in the graph of suicides.
Along similar lines, are there certain months of the year where people of certain ages are killed more? And does this vary by race?
Across both dataframes and for all years, there does not seem to be any relationship between month and age. There is no clear trend among races either.
Let’s take a look at the correlation between month and age (using Pearson’s R) for both dataframes to confirm whether there is really no relationship.
First suicides:
## [1] 0.004586628
That’s almost zero correlation, which means close to no relationship is present.
Let’s look at accidental deaths:
## [1] 0.01071504
Not as small of a correlation but still close to zero. No relationship here either.
I then decided to look at proportions of different categorical variables across the age groups. I used either side-by-side bar charts or stacked bar charts, depending on which was easier to interpret for each variable. Here are the plots for each variable, with suicides preceding accidental deaths each time.
Race:
Location:
Sex:
Season:
Education Level:
The first plot I want to highlight is the line graph of death counts for each age. I used both the suicidal and accidental plots as the basis for my project at Emory. The D3.js visualization highlighted some interesting areas of the plots, in order to emphasize the effects that guns have at different ages.
In my project, I highlighted the peak of the graph at ages 45-62, with a fact that I believed helps explain this peak. The text reads “Veterans have high firearm suicide rates.” Accompanying the text, there is a citation of a study from a literature review conducted by the Harvard T.H. Chan School of Public Health found here (https://www.hsph.harvard.edu/hicrc/firearms-research/). I included this fact because I believe veteran suicide rates contribute at least partially to this peak.
In my project, I also highlighted certain areas of this graph. First, I pointed out that at age 0, there were already 11 accidental deaths. This is one of the main differences between this graph and the suicides one: the deaths start a younger age. This detail can be attributed to many factors, including lack of child proof gun laws or poor gun storage regulation. Another part of the graph I highlighted was the peak at ages 17-27, which is significantly lower than the peak ages of the suicides graph. The text that I included here read “Majority of firearm accidents occur under age 24. Most young people are shot by someone their own age.” This came from the same Harvard literature review. The reason I decided on these graphs for my D3.js visualization, instead of the bar charts of age groups, was to highlight the trends at individual ages, rather than forcing the viewer to make inferences based on somewhat arbitrary age divisions.
While I ended up choosing the plots above to focus on for my project, there were some other interesting plots I did not get to highlight. One of them is the stacked bar chart of deaths by race for each age group, specifically the one from the accidental deaths dataframe:
In contrast to the suicides graph (shown earlier in the report), in which white people represent about 75% or more of the deaths for each age group, this graph shows significantly higher proportions of black people killed at younger ages. There are a few possible factors that could have led to these higher proportions. My best guess is that this occurs because police officers disproportionately kill young black people more than young people of other races. While these killings are motivated by racial prejudices, they are often recorded as accidents, because police officers responsible for these shootings believe that they are accidents consciously and report them as so. While I cannot claim definitively that this is the reason behind these differing results, the larger proportions certainly caught my attention, and this theory was the first thing to come to mind.
Although this R markdown file was a stepping stone for a specific visualization, it was definitely an engaging process to discover some patterns that were more obvious within the dataset and some that were less obvious. The visualization I ended up creating for my project could not capture all of the interesting findings that came up in my exploration. There were some fundamental differences between the suicides and accidental deaths dataframes but the two groups were also more similar than I expected at some points. Therefore, exploration was key to really understanding the dataset. One challenge of working with this dataset, however, was that the majority of the variables were categorical, making it difficult to find relationships between variables. This caused me to work with counts of variables most of the time, which is why majority of the visualizations are bar charts. This also forced me to work with age a lot, since it was the only quantitative variable that showed interesting (inferred) relationships with other variables. I am satisfied with the amount I explored the age and age group variables, and I feel that I learned a lot in the process. Nevertheless, this decision to focus on age really set the tone for the majority of the project, and I did not explore other variables much, except when it came to their relationship with age. One area of future analysis can involve grouping the data by race or education the way I did for age groups, and seeing what interesting findings come up in those areas. Finally, I felt limited in that dataset only had information on the victims of gun deaths and not on the shooters themselves. While my focus was on suicides and accidental deaths, I may have focused on homicides more if there was information on the shooter. In conclusion, while exploring this dataset was a rewarding process, it would be greatly beneficial if similar data was collected from the last few years with information on both parties involved in a shooting, and with new variables added in.